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Abstract. In 1961 Herbert Simon and Albert Ando published the theory behind the long- 
term behavior of a dynamical system that can be described by a nearly uncoupled matrix. Over 
the past fifty years this theory has been used in a variety of contexts, including queueing theory, 
brain organization, and ecology. In all these applications, the structure of the system is known and 
the point of interest is the various stages the system passes through on its way to some long-term 
equilibrium. 

This paper looks at this problem from the other direction. That is, we develop a technique for 
using the evolution of the system to tell us about its initial structure, and we use this technique to 
develop a new algorithm for data clustering. 
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1. Introduction. There is no shortage of data clustering algorithms. Indeed, 
many individual algorithms provide one or more parameters that can be set to a 
variety of values, effectively turning that single algorithm into many. Even if we 
restrict ourselves to a single algorithm with fixed starting parameters, we can still 
get varied results since methods like fc-means and nonnegative matrix factorization 
(NMF) use random initializations that can lead to different final results. 

Rather than be frustrated by this seeming inconsistency of solutions, some clus- 
tering researchers have approached this problem with the goal of using all these clus- 
terings to arrive at a single clustering solution that is superior to any individual 
solution. 

The purpose of this article is to motivate and develop a new method for merging 
multiple clustering results using theory on the behavior of nearly uncoupled matrices 
developed by Nobel laureate Herbert Simon and his student Albert Ando. 

When a collection of clustering methods is used, the collection is called an ensem- 
ble, and so this process is sometimes referred to as ensemble clustering. Others use 
the term cluster aggregation [20]. Since the goal is for these varied methods to come 
to some agreement, it is also sometimes known as consensus clustering, which will be 
the term used throughout this paper. 

The starting point for any clustering method is an m-dimensional data set of n 
elements. The data set can thus be stored as an m x n matrix A where each column 
represents an element of the data set and each row contains the value of a particular 
attribute for each of the elements. If the assignment of clusters from a single run of 
a clustering algorithm is denoted by Ck , then the input to any consensus method will 
be C = {Ci,C 2 , ■ ■ ■ ,C r }. 

One approach for solving this problem is attempting to find a clustering C* that is 
as close as possible to all the C^'s. This is an optimization problem known as median 
partition, and is known to be NP-complete. A number of heuristics for the median 
partition problem exist. Discussion of these heuristics with comparisons and results 
on real- world data sets can be found in [T4] \TE\ [21] . 
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Other researchers have brought statistical techniques to bear on this problem, 
using bootstrapping or other more general resampling techniques to cluster subsets 
of the original data set, and then examining the results using some measure of con- 
sistency to settle on the final clustering [TSJ . 

Additional approaches include a consensus framework built on a variational Bayes 
mixture of Gaussians model |23j and using algorithms originally intended for rank 
aggregation problems [2J. 

Other approaches to this problem begin by storing the information from each C)~ 
in an n x n adjacency matrix such that if data set elements i and j are in the same 
cluster according to Ck, then a\j = 1, and af^ 1 = if they are not (in this paper we 

will define = 1 for i = 1,2,..., n). The collection of these r adjacency matrices 
can be used to define a hypergraph which can then be partitioned (i.e. clustered) 
using known hypergraph partitioning algorithms [47 . 

Alternatively, this collection of adjacency matrices can be summed to form the 
consensus matrix S. Each entry of S denotes how many times elements i and j 
clustered together. For those who would prefer that all entries of S lie in the interval 
[0, 1], S can be defined as the sum of the adjacency matrices times resulting in a 
symmetric similarity matrix whose similarity measure is the fraction of the time that 
two elements were clustered together. In this paper, S will always be used to refer to 
the sum of the adjacency matrices. 

Once S is constructed, its columns can be clustered and thus the original data 
is clustered [35]. This method using single-link hierarchical clustering on S, after 
elements below a threshold have been zeroed out, has proven effective [TT] . 

A new methodology developed to cluster different conformations of a single drug 
molecule comes the closest to the approach developed in this paper. For this appli- 
cation, a Markov chain transition matrix can be created where the ij-th entry gives 
the probability the molecule changes from conformation i to conformation j. The 
goal is to then find sets of conformations such that if the molecule is currently in a 
particular set, it will remain in that set for a relatively long time. Approaches to 
this clustering problem have included examination of the first few eigenvectors of the 
transition matrix f and then improved in |12j). clustering the data based on the 
second singular vector |19[ I49j , and spectral analysis of a family of Hermitian matrices 
that is a function of the the transition matrix |25j . 

2. A new approach. The data clustering method introduced in this paper is 
based on the 1950's variable aggregation work of the Nobel prize winning economist 
Herbert Simon and his graduate student Albert Ando [41]. Their theory will be 
reviewed in Section [2~T] and further theoretical work will be developed in Sections |2.2| 
- |2.4| before the algorithm is introduced in Section [3] 

2.1. Theoretical background. Simon-Ando theory was originally designed as 
a way of understanding the short and long term behavior of an economy with a 
certain structure. Figure |2.1| illustrates a simple system where Simon-Ando theory 
would apply. 

Such a closed economic system, without any outside influences, is known to even- 
tually reach a state of equilibrium, that is, after some initial fluctuations, the flow 
of goods and capital between any two industries will remain more or less constant. 
Rather than waiting for this economic equilibrium to occur, Simon and Ando tried 
to predict the long-term equilibrium by making only short-term observations. They 
proved that what happens in the short run completely determines the long-term equi- 
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Fig. 2.1: This figure illustrates a simple Simon- Ando system and how it would be 
represented in matrix form. Let the circles on the left represent three small countries. 
The graphs within each circle represent companies in those countries and the solid lines 
between them represent a large amount of capital exchange between the companies. 
The dashed lines represent a small amount of cross-border exchange. A matrix whose 
entries represented the amount of economic activity between any two companies in 
this system would look like the one on the right with the shaded areas being dense 
with relatively large values and the epsilons being relatively small. 



librium. 

Over the years scholars in a variety of disciplines have realized the usefulness of 
a framework that represents a number of tightly-knit groups that have some loose 
association with each other, and Simon-Ando theory has been applied in areas as 
diverse as ecology |28| . computer queueing systems [S], brain organization [45] . and 
urban design [40 . Simon himself went on to apply the theory to the evolution of 
multicellular organisms |4l2"] . 

The n x n matrix S is called uncoupled if it has the form 

/ Su ... \ 

o s 22 ... o 

5= .... , 
V ...S kk J 

where the diagonal blocks Su are square. If S is not uncoupled for any value of 
k > 2 and if entries in the off-diagonal blocks are small relative to those in the 
diagonal blocks, then we say that S is nearly uncoupled. The matrix in Figure [2~T| is 
an example of a nearly uncoupled matrix. A more formal measure of uncoupledness 
will be introduced in Definition 12. 121 

If the consensus matrix S described in the Introduction is nearly uncoupled, we 
will show that Simon-Ando theory can be used to cluster the data it describes. Notice 
that S is symmetric and this combined with it not being uncoupled means S is also 
irreducible. For reasons that will soon become apparent, the new clustering method 
will require that S be converted to doubly stochastic form. This new matrix will be 
called P and the data clustering method will depend on P having a unique stationary 
distribution vector (which is guaranteed by irreducibility) and a known structure 
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(which is guaranteed by double stochasticity) . 

Before we can use P to cluster data we need to introduce the concept of stochastic 
complement at ion . 

If P is stochastic then each diagonal block Pa has a stochastic complement defined 

by 

(2.1) C u = P ll + P l *(I-PiT 1 P* l , 

where Pj is the matrix obtained by deleting the ith row and ith column of blocks from 
P, Pi* is the ith row of blocks of P with Pa removed, and P« is the ith column of 
blocks of P with Pa removed. Since every principal submatrix of I — P of order n — 1 



or smaller is a nonsingular Af-matrix, the matrix (I — Pj) found in (2.1 1 is defined 
and (I — Pi)^ 1 > 0. Furthermore, if P is stochastic and irreducible, then each Cu is 
itself a stochastic, irreducible matrix with stationary distribution vector cf [41 152"]. 
Let xf be a probability row vector and consider the evolution equation 

(2.2) xf = xJ_ x P 
or its equivalent formulation 

(2.3) xJ = x^P\ 

Simon-Ando theory asserts that xf will pass through distinct stages as t grows 
to infinity. Meyer describes how these stages can be interpreted in terms of the 
individual stationary distribution vectors cf . The following lemma and theorem will 
aid in extending that explanation to the case where P is doubly stochastic. The 
proof of the lemma is a direct application of principles of permutation matrices and 
is omitted. 

Lemma 2.1. Let P be an n x n irreducible doubly stochastic matrix in which 
the diagonal blocks are square. Let Q be the permutation matrix associated with an 
interchange of the first and ith block rows (or block columns) and let P be defined as 

P = QPQ. 

If P is partitioned into a 2 x 2 block matrix 

(2.4) P =( B 11 ? 2 ) where = P ^ 
then the stochastic complement of Pa is 

(2.5) C u = C u = Pu + Pa (i - P22Y 1 P21 



Theorem 2.2. If 



P = 



( Pn 
P21 



P12 
P22 



V Pkl Pk2 



Plk \ 

P2k 



Pkk J 



is an irreducible doubly stochastic matrix, then each stochastic complement is also an 
irreducible, doubly stochastic matrix. 
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Proof. As stated earlier, if the stochastic matrix P is irreducible, then so are each 
of its stochastic complements. Therefore, we need only prove that each Su is doubly 
stochastic. For a giv en i, suppo se d iagonal block Pa has been repositioned such that 



Pn = Pa as in (2.4) of Lemma 



2.1 



Let e represent a column vector of all ones. Both the row and column sums of P 
are one, so allowing the size of e to be whatever is appropriate for the context, the 
following four equations are true 



(2.6) 



Pne + P 12 e = e 



(2.7) 



P 2 ie + P 22 e = e 



(2.8) 



e T Pn + e T P 2 i = e T 



(2.9) 



e T P 12 + e T P 22 = e T 



Equations |2.7| and |2.9| can be rewritten to yield 



e = [l - 



(/ - P22) 



-l _ 



P2ie and e = e P] 



12 



(/-P 22 ) _1 



As noted earlier, (/ — P 22 ) 1 > 0, and hence 



Cu=Pu+P 12 (I-P22) P21 >0 



-1 _ 



Multiplying C\\ on the right by e and on the left by e T yields 



die = Pi ie + Pi 



(i - P22) 



-1 _ 



p2ie = Pne + Pi 2 e = e 



and 



e T Cn = e T P n + e T P r . 



(/ - P22) 



-1 _ 



P21 = e J Pn + e J P 21 



Therefore, since Cn — Cu, each stochastic complement is doubly stochastic. □ 

Markov chain theory tells us that as t — > 00, xj will approach the uniform dis- 
tribution vector (1/n 1/n ... 1/n). If the size of each Pa is rii x n^, we also know 
that c^" = (1/rii 1/rii ... 1/rii). 

As t increases from zero, xf initially goes through changes driven by the compar- 
atively large values in each Pa- Once these changes have run their course, the system 
settles into a period of short-term stabilization characterized by 



(aiCi a 2 c 2 . . . a k c k ) 



Oil Oil 
Til TT-l 



III 



a 2 a 2 
n 2 n 2 



a 2 
n 2 



ate «fc 



Ok 

n k 



where each on is a constant dependent on x q . 
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After this equilibrium period, the elements of xf begin to change again through 
a period called middle-run evolution, this time being affected by the small values in 
the off-diagonal blocks, but the change is predictable and can be described by 



x t fa (/3jci P2C2 ■ ■ ■ PkCk) 



where each /3i is dependent on t. 

Simon and Ando were not interested in clustering data. For them, the importance 
of stages like short-term stabilization and middle- run evolution lie in the fact that even 
for small values of t, the structure of xf reflected the stationary probability vectors 
of the smaller Cu matrices. From there, examination of the xf vector during the 
relatively stable periods would allow for determination of these smaller stationary 
probability vectors and facilitate the calculation of the stationary probability vector 
for P. 

For cluster analysis however, the focus is turned around. Since we will be using 
doubly stochastic P matrices, we already know that the stationary probability vector 
is the uniform probability vector. We also know that each diagonal block Pa is 
associated with a uniform probability vector related to its stochastic complement. 
Identification of the clusters then comes down to examining the entries of xf. The 
key is to look for elements of xf that are approximately equal. The only difference 
between short-run and middle-run is whether the elements of xf stay at approximately 
the same value for a number of iterations or move together towards the uniform 
probability distribution. 

All the development in this section assumed a doubly stochastic matrix. We will 
now consider how to convert a matrix into doubly stochastic form, and show that the 
process does not destroy any of the desirable characteristics of our matrix. 

2.2. Sinkhorn-Knopp. The process of converting a matrix into doubly stochas- 
tic form has drawn considerable attention, and in f964 Sinkhorn showed that any 
positive square matrix can be scaled to a unique doubly stochastic matrix |43j . This 
result can be extended to nonnegative matrices as long as the zero entries are in just 
the right places. An understanding of this zero structure will require some definitions. 

Definition 2.3. (Sinkhorn and Knopp 144V A nonnegative nxn matrix S is said 
to have total support if S 7^ and if every positive element of S lies on a positive di- 
agonal, where a diagonal is defined as a sequence of elements Swilj S2<r(2)7 • ■ • ; s n „( n ) 
where a is a permutation of {1, 2, . . . , n}r\ 

Definition 2.4. (Mine JJ$ , p. 82) An nxn matrix S is partly indecomposable 
if there exist permutation matrices P and Q such that 



PSQ 



X Z 
Y 



where X and Y square. If no such P and Q exist, then S is fully indecomposable. 

Definition 2.5. (Mine J3$ , p. 82) Two matrices A and B are permutation 
equivalent, or p-equivalent, if there exist permutation matrices Q and Q such that 
A = QBQ. 



1 Notice that by this definition of diagonal, the main diagonal of a matrix is the one associated 
with the permutation a = (1 2 3 ... n). 
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This new terminology will help in understanding the following, nearly identical 
theorems that were independently proven and then published within a year of each 
other, the first in 1966 and the second in 1967. 

Theorem 2.6. (Brualdi, Parter, and Schneider [$]) If the n x n matrix A is 
nonnegative and fully indecomposable, then there exist diagonal matrices D\ and Di 
with positive diagonal entries such that D1AD2 is doubly stochastic. Moreover D\ 
and D 2 are uniquely determined up to scalar multiples. 

Theorem 2.7. (Sinkhorn and Knopp 144V If the n x n matrix A is nonnegative, 
then a necessary and sufficient condition that there exists a doubly stochastic matrix 
of the form D\ADi where D\ and D2 are diagonal matrices with positive diagonal 
entries is that A has total support. If D1AD2 exists, then it is unique. Also D\ and 
D 2 are unique up to a scalar multiple if and only if A is fully indecomposable. 

The uniqueness up to a scalar multiple of Di and D 2 mentioned in both theo- 
rems means that if E\ and E2 are also diagonal matrices such that E1AE2 is doubly 
stochastic, then E\ = atD\ and E2 = fi~Di where a/3 = 1. 

The way that the consensus similarity matrix S is constructed guarantees its non- 
negativity, so the only thing standing in the way of knowing that the scaling matrices 
Di and D 2 exist is showing that S either has total support or is fully indecompos- 
able. Reviewing the definitions of these terms, neither of these tasks seems inviting. 
Fortunately, there is a theorem that will simplify the matter. 

Theorem 2.8. (Mine 1341, p. 86) A nonnegative matrix is fully indecomposable if 
and only if it is p-equivalent to an irreducible matrix with a positive main diagonal. 

S is trivially p-equivalent to itself since S — I SI and S is an irreducible matrix 
with a positive main diagonal. Now that we know S is fully indecomposable, its 
symmetry is going to guarantee another useful result. The proof of the following 
lemma is included since there was a typographical error in the original paper. 

Lemma 2.9. (Csima and Datta ilO\[ ) Let S be a fully indecomposable symmetric 
matrix. Then there exists a diagonal matrix D such that DSD is doubly stochastic. 

Proof. Let Di and D 2 be nonnegative diagonal matrices such that DiSD 2 is 
doubly stochastic. Then {DiSD2) T = D2SD1 is also doubly stochastic. By the 
uniqueness up to a scalar multiple from Theorems |2.6| and |2.7[ we know D2 = otD\ 
and D\ — f5D%. Using the first of these facts 

D X SD 2 = D l SaD l 

= y/b^D 1 S^D 1 
= DSD 

shows us that D = yfa D±. □ 

2.3. The structure of DSD. We will use P as the symbol for the doubly 
stochastic matrix derived from S, that is P = DSD. For simplicity of notation, the 
i th diagonal entry of D will be denoted di. We will show that P has the same desirable 
properties that S has. 

Lemma 2.10. If S is an n x n fully indecomposable irreducible matrix and P = 
DSD is doubly stochastic, then P is irreducible. 

Proof. Since S is irreducible, there is no permutation matrix Q such that 

QSQ T = 



X Z 
Y 



where both X and Y are square. 



8 



C. D. MEYER AND C. D. WESSELL 



Thus the only way that P = DSD could be reducible is if the zero structure of S 
is changed by the multiplication. But notice that since Pij = didjSij and both di and 
dj are positive, p^ = only when Sij — 0. So the zero structure does not change, and 
P is irreducible. □ 

Since the number of times elements i and j cluster with one another is necessarily 
equal to the number of times elements j and i cluster with one another, the symmetry 
of the consensus similarity matrix S reflects a real-world property of the consensus 
clustering problem and so it is important that symmetry is not lost when S is converted 
into P. 

Lemma 2.11. If S is an n x n fully indecomposable symmetric matrix and P = 
DSD is doubly stochastic, then P is symmetric. 
Proof. 

(2.10) P T = (DSD) T = DS T D = DSD = P 

□ 

We wish to prove that if S is nearly uncoupled, then so is P. To do so we first need 
a formal definition of near uncouplcdncss. Then we will show how this uncoupling 
measure for P is related to the uncoupling measure of S. 

Definition 2.12. Let n\ and n2 be fixed positive integers such that n\+U2 = n, 
and let S be an nxn symmetric, irreducible matrix whose respective rows and columns 
have been rearranged to the form 

Sn S12 
S21 S22 

where Sn is m x ni and S22 is n 2 x n 2 so that the ratio 

e T Si 2 e + e T S 2 ie _ 2e T Si 2 e 
T Se ~ e T Se 



cr(S,ni) = 



is minimized over all symmetric permutations of S. The quantity a{S,ni) is called 
the uncoupling measure of S with respect to parameter m. In other words a(S,ni) 
is the ratio of the sum of the elements in the off-diagonal blocks to the sum of all the 
matrix entries. 

Before moving on, two points should be made clear. First, there is no arbitrary 
uncoupling measure value below which a matrix is deemed to be nearly uncoupled. 
Rather, a(S, ni) is a relative value whose meaning is dependent on the uncoupling 
measures of S using other choices of ni or on comparisons with other similarity ma- 
trices a researcher has experience with. Second, exact calculation of the uncoupling 
measure for all but very small problems is not feasible, but its theoretical value is im- 
portant since it allows us to compare matrices S and P as the the following theorem 
shows. 

Theorem 2.13. If S is the n x n consensus matrix created from r clustering 
results, then for the doubly stochastic matrix P = DSD, <r(P, m) < ^a(S,ni), 
where S = e T Se. 

Proof. By the way we constructed S, su = r for i = 1, 2, ... , n. Since pa — didiSu 
and pa < 1, it follows that dfr implies rf, < 

If we impose the same block structure on D that exists for S, that is 



D 



Di 
£> 2 
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and recall that P is doubly stochastic, 

, D 2e T D 1 S 12 D 2 e 
n 

Since each element of D\ and D 2 is less than -4=, 

(^) 2 {2e T S 12 e) E 
n nr 

and the bound is established. □ 

2.4. The spectrum of P. Consider the following facts about the eigenvalues 
off. 

1. Since P is stochastic, all of its eigenvalues lie on or inside the unit circle of 
the complex plane. 

2. Since P is real-symmetric, all of its eigenvalues are real. Combined with the 
last fact, this means all eigenvalues of P reside in the interval [—1, 1]. 

3. The largest eigenvalue of P is one, and since P is irreducible, that eigenvalue 
is simple (i.e. it appears only once). 

4. Aj(P) 7^—1 for all i because P is a primitive matrix. P is primitive because 
it is irreducible and has at least one positive diagonal element ([33], p. 678). 

Unlike Markov chain researchers who desire a small second eigenvalue since it leads 
to faster convergence when calculating the chain's stationary distribution vector, we 
want a second eigenvalue near one. Slow convergence is a good thing for us since it 
allows time to examine the elements of x t as it passes through short-term stabilization 
and middle- run evolution. Also, A 2 (P) ~ 1 may indicate that the matrix is nearly 
uncoupled [4"B"j . 

We will now show that \%{P) ~ 1 along with other properties of P guarantees that 
P is nearly uncoupled. First, observe the following lemma whose proof is self-evident. 
Lemma 2.14. Let {Pk} be a sequence of matrices with limit Pq. Then, 

1. If each matrix in {Pk} is symmetric, Pq is symmetric, and 

2. If each matrix in {Pk} is stochastic, Pq is stochastic. 

Theorem 2.15. For a fixed integer n > 0, consider the n x n irreducible, sym- 
metric, doubly stochastic matrix P. Given e > 0, there exists a 5 > such that if 
a{P,m) < S, then |A 2 (P) — 1| < e. In other words, if P is sufficiently close to being 
uncoupled, then X 2 (P) rs 1. 

Proof. Two proofs will be presented. The first relies on a continuity argument, 
while the second gives an explicit bound on \X 2 (P) — 1|. 

Proof (1): Let e > 0. Consider a sequence of irreducible, symmetric, doubly 
stochastic matrices 

p(fc) p(fc) 

^11 -M.2 
p(fc) p(fc) 
r 21 r 22 

defined so that lim a(Pk,ni) = 0. The Bolzano- Weierstrass theorem ([3J, p. 155) 

k— foo 

guarantees that this bounded sequence has a convergent subsequence Pk 1 , Pk 2 , ■ ■ ■ 
which converges to a stochastic matrix T whose structure is 



T = 



Tn 
T 22 



Tu ^0,T 22 ^0, 
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where each Tu is stochastic. By the continuity of eigenvalues, there exists a positive 
integer M such that for fej > M, 

|A 2 (P fei ) - A 2 (T)| < e |A 2 (P fe J - 1| < e, 

and the theorem is proven. 

Proof (2): Suppose that the rows and respective columns have been permuted so 

that 

Pn Pyi 
P21 P22 

where P is nearly uncoupled, and define C to be the nx n block diagonal matrix with 
the stochastic complements of Pn and P22 on the diagonals, that is 



C = 



Cu 
C22 



If E is defined to make the equation C = P + E true, then a consequence of the 
Courant-Fisher Theorem can be used ([33], pp. 550-552) to show that for any matrix 
norn0 

A 2 (P) ~ < 1 < A 2 (P) + \\E\\ -+ |1 - A 2 (P)| < \\E\\. 

□ 

Theorem 2.16. For a fixed integer n > 0, consider the n x n irreducible, sym- 
metric, doubly stochastic matrix P. Given e > 0, there exists a 8 > such that if 
|A 2 (P) — 1| < 5, then a(P,ni) < e for some positive integer ni < n. In other words, 
if A 2 (P) is sufficiently close to 1, then P is nearly uncoupled. 

Proof. The argument is by contradiction and similar to one used in |24j . Suppose 
there is an e > such that for any 8 > there is an n x n irreducible, symmetric, 
doubly stochastic matrix P with |A 2 (P) — 1| < 8 and c(P, rti) > e for all for positive 
integers ni < n. For 8 = J let P& be such a matrix. There must be a subsequence 
Pi 17 Pi 2 ,... which converges, say to P . Then P must have A 2 (Po) = 1 and thus 
<r(Po,ni) = 0. Yet, a(Po,ni) = lim/^oo <r(Pt, n\) > e, a contradiction. □ 

Although we previously defined an uncoupling measure for a general matrix in 



Section 2.3 for doubly stochastic matrices this theorem allows us to use A 2 as an 
uncoupling indicator with a value near one signifying almost complete uncoupling. 

There may be additional eigenvalues of P that are close to one. This group of 
eigenvalues is called the Perron cluster [111 I12j , and in the case where all eigenvalues 
are real the Perron cluster can be defined as follows. 

Definition 2.17. Let P be an n x n symmetric, stochastic matrix with eigenval- 
ues, including multiplicities, of 1 = Ai > A 2 > A3 > . . . > A„. // the largest difference 
between consecutive eigenvalues occurs between and Xk+i, the set {1, ...A&} is 
called the Perron cluster of P. If two or more pairs of eigenvalues each have differ- 
ences equal to the largest gap, use the smallest value of k to choose A&. The larger 
the gap, the more well-defined the cluster. 

Some researchers use the number of eigenvalues in the Perron cluster as the num- 
ber of clusters they search for [TTJ Q15] . This inference is a natural extension of The- 
orems [2T5] and [2T6J that is if P had k eigenvalues sufficiently close to 1, then P is 



2 If the 2-norm is used the bound is |1 — A2(P)| < 2 v / ncr(P, ni). We thank Use Ipsen for this 
observation. 
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Stochastic Clustering Algorithm (SCA) 

1. Create the consensus similarity matrix S using a clustering ensemble of 
user's choice. 

2. Use matrix balancing to convert S into a doubly stochastic symmetric ma- 
trix P. 

3. Calculate the eigenvalues of P. The number of clusters, k, is the number 
of eigenvalues in the Perron cluster. 

4. Create a random Xq. 

5. Track the evolution xj — xf_ 1 P. After each multiplication, sort the the 
elements of xf and then separate the elements into k clusters by dividing 
the sorted list at the k — 1 largest gaps. Alternatively, the elements of 
Xt can be clustered using fc-means or any other widely available clustering 
method. When this clustering has remained the same for a user-defined 
number of iterations, the final clusters have been determined. 



Fig. 3.1: The Stochastic Clustering Algorithm 



nearly uncoupled with k dominant diagonal blocks emerging after an appropriate per- 
mutation QPQ T . This is also the approach we will take with the stochastic clustering 
algorithm. Unlike with the vast majority of clustering methods, the user will not have 
to tell the algorithm the number of clusters in the data set unless they explicitly want 
to override the algorithm's choice. Instead, the stochastic clustering algorithm will 
set k equal to the size of the Perron cluster. 

3. Putting the concept into practice. Now that the theoretical underpin- 
nings are in place, it is time to formally describe the stochastic clustering algorithm. 

The algorithm takes as input the consensus similarity matrix S which the user has 
created from whatever combination of clustering methods and/or parameter settings 
they choose. S is then converted into the doubly stochastic matrix P using the 
Sinkhorn-Knopp algorithm. All eigenvalues are computed, and the Perron cluster of 
P is identified. Eigenvalues of symmetric matrices can be efficiently computed [37], 
but if finding all eigenvalues is too costly, the user, with knowledge of the underlying 
data set, can direct the program to find only the k largest eigenvalues (k > k). The 
size, k, of the Perron cluster of these k eigenvalues is then used by the stochastic 
clustering algorithm to separate the data into k clusters. 

Starting with a randomly generated Xq,xJ — xf_ x P is evaluated for t = 1, 2, 

After each calculation, the entries of xf are sorted, the k— 1 largest gaps in the sorted 
list identified and used to divide the entries into k clusters. When the k clusters have 
been identical for n iterations, where n is a user-chosen parameter, the program stops 
and the clusters returned as output. Figure [3~T] summarizes the algorithm. 

3.1. A Small Example. Consider the following small data matrix which in- 
cludes the career totals in nine statistics for six famous baseball players (the row 
labels stand for Games, Runs, Hits, Doubles, Triples, Home Runs, Runs Batted In, 
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Stolen Bases, and Bases on Balls). 





Rose 


Cobb 


1 isk 


Utt 


Ruth 


TV /r~ 

A 1 ays 


G 


/ 3562 


3034 


2499 


2730 


2503 


2992 \ 


R 


2165 


2246 


1276 


1859 


2174 


2062 


H 


4256 


4189 


2356 


2876 


2873 


3283 


2B 


746 


724 


421 


488 


506 


523 


A= 3B 


135 


295 


47 


72 


136 


140 


HR 


160 


117 


376 


511 


714 


660 


RBI 


1314 


1938 


1330 


1860 


2213 


1903 


SB 


198 


897 


128 


89 


123 


338 


BB 


\ 1566 


1249 


849 


1708 


2062 


1464 j 



Those familiar with baseball history would mentally cluster these players into 
singles hitters (Rose and Cobb), power hitters (Mays, Ott, and Ruth) and, a great 
catcher who doesn't have enough home runs and runs batted in to fit with the power 
hitters nor the long career and large number of hits to fit with the singles hitters 
(Fisk). 

The consensus similarity matrix was built using the multiplicative update version 
of the nonnegative matrix factorization algorithm 27J. Since it isn't clear whether two 
or three clusters would be most appropriate, S was created by running this algorithm 
50 times with k = 2 and 50 times with k = 3. The resulting similarity matrix is 



S = 







Rose 


Cobb 


Fisk 


Ott 


Ruth 


Mays 


Rose 




100 


67 


73 


2 





2 \ 


Cobb 




67 


100 


50 


1 


2 


7 


Fisk 




73 


50 


100 


15 


9 


24 


Ott 




2 


1 


15 


100 


92 


82 


Ruth 







2 


9 


92 


100 


77 


Mays 


V 


2 


7 


24 


82 


77 


100 / 



With a small example like this, especially one where the players that will cluster 
together have been purposely placed in adjacent columns, it would be simple enough 
to cluster the players through a quick scan of S. However, following the algorithm to 
the letter we apply the Sinkhorn-Knopp algorithm. The resulting doubly stochastic 
matrix (rounded to four places) is 



P = 





Rose 


Cobb 


Fisk 


Ott 


Ruth 


Mays 


Rose 


/ 0.4131 


0.2935 


0.2786 


0.0075 





0.0075 \ 


Cobb 


0.2935 


0.4644 


0.2023 


0.0040 


0.0082 


0.0277 


Fisk 


0.2786 


0.2023 


0.3525 


0.0517 


0.0323 


0.0826 


Ott 


0.0075 


0.0040 


0.0517 


0.3374 


0.3233 


0.2761 


Ruth 





0.01082 


0.0323 


0.3233 


0.3660 


0.2701 


Mays 


\ 0.0075 


0.0277 


0.0826 


0.2761 


0.2701 


0.3361 / 



The eigenvalues of P are 1.0000, 0.8670, 0.2078, 0.1095, 0.0598, and 0.0254 sug- 
gesting that there are two clusters in this data. 

Table |3.1| shows the results from a sample run of the clustering method. The 
initial probability vector Xq was chosen randomly, and the table shows the value of 
xf and the corresponding clusters for the next seven steps of the algorithm. Since 
k = 2, the clusters are determined by ordering the entries of xj ', finding the largest 
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gap in this list, and clustering the elements on either side of this gap. For example, 
at the t = 6 step shown in the table, the largest gap in the sorted list is between 
0.1609 and 0.1715. This leads to the numerical clustering of {0.1597,0.1600,0.1609} 
and {0.1715,0.1739,0.1741} which translates to the clustering {Rose, Cobb, Fisk} 
and {Ott, Ruth, Mays}. 



Table 3.1: Following the Stochastic Clustering Algorithm for the Small Example 



t 


T 
x t 


Clusters 





( 0.2334 0.2595 0.0364 0.2617 0.1812 0.0279 ) 


{Rose, Cobb, Ott, Ruth} 
{Fisk, Mays} 


1 


( 0.1848 0.1997 0.1520 0.1592 0.1618 0.1425 ) 


{Rose, Cobb} 
{Fisk, Ott, Ruth, Mays} 


2 


( 0.1795 0.1836 0.1707 0.1554 0.1557 0.1550 ) 


{Rose, Cobb, Fisk} 
{Ott, Ruth, Mays} 


3 


( 0.1779 0.1787 0.1732 0.1565 0.1561 0.1576 ) 


{Rose, Cobb, Fisk} 
{Ott, Ruth, Mays} 


4 


( 0.1765 0.1765 0.1729 0.1578 0.1574 0.1589 ) 


{Rose, Cobb, Fisk} 
{Ott, Ruth, Mays} 


5 


( 0.1752 0.1751 0.1722 0.1590 0.1586 0.1600 ) 


{Rose, Cobb, Fisk} 
{Ott, Ruth, Mays} 


6 


( 0.1741 0.1739 0.1715 0.1600 0.1597 0.1609 ) 


{Rose, Cobb, Fisk} 
{Ott, Ruth, Mays} 


7 


( 0.1731 0.1729 0.1709 0.1609 0.1606 0.1616 ) 


{Rose, Cobb, Fisk} 
{Ott, Ruth, Mays} 



From t = 2 on the clusters remain the same. The SCA defines the stopping 
condition as a user-defined number of consecutive identical clusterings. If that number 
is six, then the final clustering of {Rose, Cobb, Fisk} and {Ott, Ruth, Mays} is 
determined when t = 7. For the reader wondering if the clustering changes at some 
later point, the algorithm was run through t = 1000 and the same clustering was 
found at each step. 

4. Implementation. As is to be expected with a new algorithm, actual imple- 
mentation of ideas that looked fine on paper can still be problematic. Even before 
implementation, there may be concerns about perceived weak links in the algorithm. 
In this section we will address some of these concerns. Since this section and the 
results section involve many of the same issues, it will be hard to talk about them 
without highlighting some of the results to come. Hopefully, no great surprises are 
spoiled, and the turning of pages back and forth is kept to a minimum. 

4.1. Impact of initial probability vectors. The fact that the stochastic clus- 
tering algorithm depends on a random initial probability vector (IPV) raises the 
question of whether all random probability vectors will lead to the same clustering. 
Since P is irreducible, we are guaranteed that the matrix has a unique stationary 
distribution vector that is independent of the IPV. But, for clustering purposes, that 
is not the issue. Instead we would like to have confidence that for a certain IPV, xf 
will remain in short-term stabilization and middle-run evolution long enough for us 
to identify the clusters. Secondly, as we will see soon in Section [5j different IPVs can 
lead to different cluster results. 

We will consider the IPV question in two parts. First we address the rare occur- 
rence of an IPV that does not lead to a clustering at all, and then we address the fact 
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that different IPVs can lead to different clusterings. 

4.2. IPVs leading to no solution. Clearly not every initial probability vector 
will help us in data clustering. Suppose, for example, that 



111 1 

n n n n 



lxn 



Since P nX n is doubly stochastic, Xq is its stationary distribution vector. With such a 
choice for the IPV, xj never changes and we have no ability to group the probabilities 
in xf in order to cluster the original data. 

It is simple enough to make sure that Xq is not the uniform distribution vector, 
but it is equally important that there are enough iterations for the algorithm to 
recognize either short-term stabilization or middle-run evolution before xf reaches 
the uniform vector. Since each new xj is the result of the continuous operation of 
matrix multiplication, xf being close to the uniform distribution vector, ensures that 
xj +1 can not be significantly further away for it. Therefore, even though the algorithm 
generates Xq randomly, the cautious user may want to set a tolerance e and if 

||a£-(l/n 1/n ... l/n)\\ < e, 

generate another Xq. It should be noted that in the preparation of this paper the 
stochastic clustering algorithm was run hundreds, if not thousands, of times and never 
was a failure due to an IPV being too close to the uniform distribution. 

4.3. IPVs leading to different solutions. The fact that cluster analysis is 
an exploratory tool means that getting different solutions depending on the initial 
probability vector is not the end of the road, but rather an opportunity to examine 
these solutions in the hope of gaining additional insight into the data set's structure. 

That said, it would still be instructive to know as much as possible about the 
characteristics shared by IPVs that lead to the same solution, how many different 
solutions are possible, and how often each of them is likely to appear. Probabilistic 
analysis of random starting vectors has been done in the context of iterative methods 
for finding eigenvalues and eigenvectors [T31 [25], and is a natural area for further 
research on the stochastic clustering method. 

4.4. Using a single measure. The workload in consensus clustering is concen- 
trated at the beginning of the process when the large number of clustering results are 
computed. Even if a user has access to a multiprocessor environment where this work 
can be shared, it would be advantageous to find a single similarity measure which is 
compatible with the stochastic clustering algorithm. 

Since the SCA is inspired by Simon-Ando theory, the underlying matrix must be 
nearly uncoupled. For a given data set, the problem with most traditional similarity 
(or dissimilarity) measures is that their values tend to the middle of their range. 
To illustrate, consider two common similarity measures: Euclidean distance and the 
cosine measure 

x\xi 

C{Xi,X 2 ) = J, rr-r, rr-. 

iFilla 1F2II2 

The former has the advantage of being familiar to almost everyone, while the latter 



has been found to be particularly useful in text-mining jo]. However, as Figures 4.1 



and 4.2 show for the leukemia DNA microarray data set that will be introduced in 
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Section [5. 2[ the distribution of values returned by these two common measures is not 
the kind of distribution needed to form a nearly uncoupled matrix. 

In the case of the cosine measure whose range is [0, 1], there have been attempts 
to "massage" distributions so that they contain more values near the extremes. Such 
methods often involve changing small values to zero and then performing some arith- 
metic operation that gives the remaining data a larger variance (for example, squaring 
each value) |50] , These methods, however, are far from subtle and in experiments for 
use with the SCA, the matrix P went from dense to too sparse for clustering in one 
iteration of attempting to adjust its values. 

Working with the Euclidean norm brings with it the additional requirement large 
distances need to be mapped to small similarity values while small distances are 
mapped to large similarity values. A typical function used in making this translation 
is a Gaussian of the form 

f{xi,x 2 )=e 

where a is a parameter that typically has to be adjusted for each similarity matrix 
|36) . This is certainly an area for future study in implementing the SCA, but so far 
a reliable way to build a matrix of Gaussians with the distribution required by the 
SCA has not been found. 

It should be noted that power iteration clustering introduced by Lin and Cohen 
has succeeded in using a single measure to cluster data using an algorithm similar 
in philosophy to the SCA. This method uses a row-stochastic Laplacian-like matrix 
derived from a similarity matrix constructed using the cosine similarity measure [291 
1501 |3"T] . Like the SCA, clusters are determined by examining intermediate iterates of 
the power method. It is interesting to note despite mentioning a Gaussian approach 
to the Euclidean norm in |29j . all results in the paper were obtained using either a 
— 1 or cosine measure. 




5 15 25 35 45 55 65 75 85 



Fig. 4.1: This is the histogram of the 703 similarity values used to build a consensus 
matrix for the 38-element leukemia DNA microarray data set that will be introduced 
in Section [5.21 The horizontal axis measures the number of times out of 100 that two 
elements clustered together. The histogram shows that pairs of data points clustered 
together either a small or large number of times. 

A single measure that has been used with some success involves the idea of nearest 
neighbors, those data points closest to a given data point using a specific distance 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0. 



Fig. 4.2: The histogram on the left shows the distribution of cosine similarity measures 
between the same elements used for Figure |4.1| while the histogram on the right 
does the same for Euclidean norm values scaled to the interval [0, 1]. Contrast these 



distributions with the one shown in Figure 4.1 



measure. For each element g in the data set, let the set M g consist of the k nearest 
neighbors of g, where the user chooses both the positive integer k and the distance 
measure used. The Sjj element of the consensus matrix is equal to the number of 
elements in Wj U Afj [J. 

Work with consensus matrices built in this fashion is still in its initial stages. It 
has become obvious that the choice of K and the distance measure greatly affect the 
results as can be seen in Table Ht] 



Table 4.1: Building a consensus matrix based on the number of shared nearest neigh- 
bors can work well or poorly depending on the value of k, the number of nearest 
neighbors calculated for each data point. The results in this table are from clustering 
the rather simple, four-cluster Ruspini data set [39 . When k = 15 the stochastic 
clustering algorithm detects five clusters. This fifth cluster only has one member, 
while the rest of the solution is correct. 



K 


Clusters 


Errors 


15 


5 


1 


20 


4 





25 


4 


18 



5. Results. In building test cases for our proposed algorithm, one complication 
is determining the ensemble used to build the initial similarity matrix S. In the 
results that follow the ensembles will typically consist of multiple runs of multiplicative 
update version of NMF [37J or fc-means or a combination of bothj^] In each case, 
the value or values of k used when calling these algorithms will be noted, though 
as explained above the new stochastic clustering algorithm will use the number of 
eigenvalues in the Perron cluster of P to determine k. 



3 For an example of how the factors found by NMF are used to cluster data see [8]. 
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5.1. Iris data set. The Fisher iris data set [16] consists of four measurements 
(petal length, petal width, sepal length, and sepal width) for 150 iris flowers, fifty 
each from three iris species (Iris setosa, Iris virginica, and Iris versicolor). It is 
well-documented that the setosa cluster is easily separable from the other two, but 
separating the virginica and versicolor species is more difficult |17j . 

When building S using NMF the choice of fc is limited to two or three since NMF 
requires fc to be less than both the dimension of the data and the number of samples. 
Running the multiplicative update version of NMF 100 times with k — 2 never results 
in a perfect splitting of setosa from the other two species, though there are three 
or fewer clustering errors 67 times. However, there are six instances of more than 
15 errors including a worst case of 26. Despite these problems, the SCA, using a 
consensus similarity matrix built from these rather poor results gets the clustering 
correct for all but three irises. Although NMF does quite poorly in trying to separate 
the irises into three clusters, the S derived from these results leads to a perfect two- 
cluster separation of setosa irises from virginica and versicolor ones. 

On the whole, individual clustering results on the iris data set using fc-means 
clustering with k — 2 or k = 3 are better than those returned by NMF. However, 
building S using the results from fc-means clustering, we get very similar results to 
what we saw with NMF. 

If we decide to build S using k = 4 just to see if it will give us any insight into the 
data set, SCA recognizes that there are three clusters in the data set, but 16 flowers 
are misclustered. Though that result may not seem encouraging, notice that this an 
improvement over the the range of errors (21 - 38) when using fc-means with fc = 3. 

Finally, the consensus matrices found using NMF and fc-means were summed to 
see if a more robust clustering than the one found by SCA using S from just one of 
these methods could be found. Notice that this approach proved fruitful as there is 
at most one error regardless of the value of fc used. 

The results from using all of these different consensus matrices are summarized 
in Table O 

Table 5.1: Clustering the iris data set, S created using NMF (first two lines), fc-means 
(next three lines), and a combination of the two (last two lines). 



Method and fc 


range of # of errors 


fc found 


# of errors 


used to create S 


in single clusterings 


by SCA 


in SCA result 


NMF (2) 


1-26 


2 


3 


NMF (3) 


19-72 


2 





fc-means (2) 


3 


2 


3 


fc-means (3) 


21-38 


2 





fc-means (4) 


n/a 


3 


16 


Combined (2) 


1-26 


2 


1 


Combined (3) 


19-72 


2 






5.2. Leukemia DNA microarray data set. In 1999 a paper was published 
analyzing a DNA microarray data set containing the gene expression values for 6817 
genes from 38 bone marrow samples |22) . Five years later, the same 38 samples 
were examined, though this time only 5000 genes were used [7J. The samples came 
from leukemia patients who had all been diagnosed with either acute lymphoblastic 
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leukemia (ALL) or acute myeloid leukemia (AML). Additionally, the ALL patients 
had either the B-cell or T-cell subtype of the disease (ALL-B or ALL-T). This data set 
is well known in the academic community (Google Scholar reports that the 1999 paper 
has been cited over 6000 times) and is an excellent test for new clustering algorithms 
since it can be divided into either two (ALL/AML) or three (ALL-B/ ALL-T/ AML) 
clusters. The actual clustering for the leukemia data set is known (see Table 5.2), 
though the 2004 paper noted that the data "contains two ALL samples that are 
consistently misclassified or classified with low confidence with most methods. There 
are a number of possible explanations for this, including incorrect diagnosis of the 
samples [7]." 



Table 5.2: The correct clustering of the leukemia DNA microarray data set. 



Diagnosis 


Patients 


ALL-B 
ALL-T 
AML 


1 - 19 
20 - 27 
28 - 38 



Since the 2004 paper was published to demonstrate the effectiveness of nonnega- 
tive matrix factorization in clustering this data set, this seems to be an appropriate 
test for the stochastic clustering algorithm, using NMF with different k values to 
build the ensemble. The data set was clustered using NMF 100 times each for k = 2 
and k = 3. Additionally, to explore the data set further, the data were clustered an 
additional 100 times for k = 4, 5 and 6. 

Figure [5 . 1 a| shows the number of errors for each clustering used in building S2 , the 
k — 2 consensus similarity matrix. NMF is clearly quite good at clustering this data 
set into two clusters, which was the point of [7]. Each time the stochastic clustering 
algorithm is used to cluster the patients based on S2, it makes exactly two errors - 
misclustering Patients 6 and 29. 

Similar comparisons were done using S3, the k = 3 consensus similarity matrix, 
and again the stochastic clustering method could not improve on the already excellent 
results of NMF. NMF made an average of 3.18 errors per clustering compared to 4.76 
for the SCA. Even the hope that the SCA would provide a narrower band of errors 
than NMF is not realized (see Table 5.1b I. Perhaps the lesson is that if the original 
method does a good job of clustering, then SCA is not likely to improve on it, though 
it is also not likely to worsen it. 

Since cluster analysis is an exploratory tool, consensus matrices S4, S5, and Sg 
were constructed to see if either the stochastic clustering algorithm or nonnegative 
matrix factorization could discover some hidden structure in the data set that would 
indicate one or more undiscovered clusters. If a group of elements all break away 
from an existing cluster or clusters, there is reason for further investigation regarding 
a new cluster. Interestingly, when k = 4, the results from both NMF and the SCA 
agree. As Table [5Tc| summarizes, they both have identified a fourth cluster made up 
of four ALL-B patients and two AML patients. 

Neither of the methods give any indication of further clusters. When k = 5 or 
k = 6 both methods begin to build two or three large clusters with the remaining 
clusters containing only two or three members. 

Before we move on to the next data set, there is one other interesting result to 
report. If the stochastic clustering algorithm is run using the sum of S2 and S3 it 
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# of Errors 


1 


2 


3 


4 


# of Instances (NMF) 


30 


65 


3 


2 


# of Instances (SCA) 





100 









(a) The leukemia DNA microarray data set was clus- 
tered 100 times using NMF with k = 2. The number 
of errors ranged between one and four. When the SCA 
was used on the consensus matrix created from those 
100 NMF clusterings, it mis-clustered Patients 6 and 
29 each time. 



# of Errors 


1 


2 


3 


4 


5 


6 


7 8 


9 10+ 


# of Instances (NMF) 





71 


3 


9 


3 


3 


1 2 


8 


# of Instances (SCA) 





67 

















33 



(b) Neither the SCA nor NMF shows an advantage over the other when clustering 
the consensus matrix S3. 



Diagnosis 


Patients 


Patients 


ALL-B 


1 - 19 


1, 3, 5, 7 - 9, 11 - 14, 16 - 18 


ALL-T 


20 - 27 


10, 20 - 27 


AML 


28 - 38 


28, 30 - 35, 37, 38 


New Cluster 




4, 6, 19, 29, 36 



(c) Both NMF and SCA agree that there may be a new cluster. The 
third column shows the membership of this new cluster and the patients 
remaining in the other three. 



Fig. 5.1: This is a collection of tables that compare the results of clustering consensus 
matrices constructed using different k- values. The consensus matrices were clustered 
by both the SCA and NMF. Table |5~Ta| compares the results for k = 2. Table |5~Tb| 
shows very little difference between the two methods when k = 3. Table 5.1c shows a 
possible fourth cluster suggested by both NMF and SCA. 



identifies two clusters and makes only one clustering mistake, namely Patient 29jj 

5.3. Custom clustering. As we first mentioned in Section [4~T] the fact that the 
stochastic clustering algorithm uses a random initial probability vector means that it 
can arrive at different solutions, and when clustering the leukemia data set we found 
this to be so. While this might be viewed as a weakness of the algorithm, it does 
give the researcher the ability to answer a very specific question by creating a specific 
initial probability vector. 

In Section |5.2[ we noticed that the SCA did not cluster the leukemia data set 
consensus matrix any better than nonnegative matrix factorization. But what if our 
primary interest was not in clustering the entire data set, but instead in finding the 
membership of the cluster of a particular data point. For example, if you are the 
physician for Patient number 2 you have limited interest in a global view of the 
leukemia data set. Indeed, rather than knowing which of the three clusters Patient 
2 belonged to, it would be of greater use to you to know a small number of other 



4 Throughout the research period for this paper, the Patient 29 sample was misclustered nearly 
100 per cent of the time. One of the authors of the 2004 paper verifies that in their work, the Patient 
29 sample was also often placed in the wrong cluster 48 . 
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Custom Clustering Algorithm (CCA) 

1. Create the consensus similarity matrix S and the doubly stochastic sym- 
metric matrix P just as in the stochastic clustering algorithm. 

2. Construct Xq to contain all zeros except for a one in the place of the element 
we are interested in creating a custom cluster for. 

3. Pass the algorithm values for the minimum and maximum size cluster you 
desire and the maximum number of iterations the CCA should take trying 
to find that cluster. 

4. After each xf = xj P multiplication, cluster the elements of xf as in the 
SCA. If the cluster containing the target element is within the size param- 
eters, output the cluster and end the program. 



Fig. 5.2: The Custom Clustering Algorithm 



patients that are most like Patient 2 in the hope that that knowledge would help you 
tailor the best treatment plan possible. 

To create such a custom clustering, we construct an IPV containing all zeros 
except for a 1 in the place corresponding to our data point of interest. We then ask 
the stochastic clustering algorithm to find the cluster containing our specific data 
point. Since we may be interested in a collection much smaller than that cluster, the 
stochastic clustering algorithm can be modified to ask for a small number data points 
whose Xt entries are closest to our target point. 

Here again we find hope in a feature of the SCA that seemed to disappoint 
us in Section |5.2| In that section, the clustering of consensus matrices built from 
methods using k = 5 and k = 6 seemed to supply new information. In fact, the 
small clusters found then are indicative of an especially close relationship between the 
cluster members. 

Incorporating these ideas using the consensus matrix Sq from Section |5.2| and an 
initial probability vector of all zeros except for a 1 in the second position gives us 
the custom cluster of {2, 4, 6, 15, 19, 29, 36}, a cluster with four other AML-B patients 
and two AML patients (although one of them, Patient 29, consistently clusters with 



the AML-B patients in our experience). These results are presented in table 5.3 
along with the six nearest neighbors of Patient 2 using Euclidean distance and cosine 
measure. The SCA's custom cluster for Patient 2 features three patients not found in 
these nearest neighbor sets and suggests that physicians could learn a great deal by 
examining these hidden connections between Patient 2 and Patients 15, 29, and 36. 

6. Discussion. These initial tests prove that the SCA can be an effective clus- 
tering tool. As with any new method, this initial promise raises multiple questions 
for further study, some of which are listed here. 

• Use probabilistic analysis of initial probability vectors to see what we can 
learn about the number of possible solutions the SCA can return and whether 
there is any connection between cr(P, n\) and the tendency of P to produce 
multiple solutions. 

• Devise a fuzzy clustering for a data set based on the multiple results returned 
when using different initial probability vectors. 

• Investigate whether in situations where the stochastic clustering algorithm 
returns multiple answers, if building a consensus matrix from these results, 
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Tabic 5.3: Custom Cluster for leukemia Patient 2. This table shows the six other 
patients most similar to Patient 2. The patients are listed in similarity order, that 
is, the first one is the one most similar to Patient 2. The cluster returned by the 
SCA differs by three patients with both lists derived from two traditional distance 
measures. 



Method 


Other Patients 


SCA 
2-norm 
cosine 


29,19,4,15,36,6 
19,16,9,3,6,18 
16,19,9,3,18,4 



and applying the SCA again will eventually yield a unique solution. 
Examine whether the Sinkhorn-Knopp balancing step can be replaced by a 
simple scaling to make all row sums equal. Though we lose the results from 
Markov chain theory, perhaps they are unneeded since all we are looking 
for is xf values that are approximately equal. The work of Lin and Cohen 
mentioned in Section [4.4| would seem to indicate that this is a possibility. 
Continue the search for a single similarity measure whose values are dis- 
tributed in a way that can be exploited by the stochastic clustering method. 
Improve the bounds for values of di. Numerical results indicate that the 
upper bound found for Theorem |2.13| can be greatly improved. 
Explore the structure of the spectrum of symmetric, irreducible, nearly un- 
coupled, doubly stochastic matrices. For this paper, we were only concerned 
with the eigenvalues near one, but from examining eigenvalues during the 
course of this research, there appears to be some structure to the spectrum, 
especially a large number of eigenvalues near zero. 

Work to find a tighter bound on the numeric connection between \2(P) and 



a(P,ni) that Theorems 2.15 and 2.16 establishes. 
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